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Background 


SUMMARY 


Cardiorespiratory  fitness  is  important  for  health,  work,  and  athletic 
performance.  Laboratory  tests  of  maximal  oxygen  uptake  (V02max)  are  the 
gold  standard  for  assessing  this  aspect  of  fitness.  V02max  protocols 
with  small  measurement  errors  will  provide  the  best  estimates  of 
relationships  between  fitness  and  its  antecedents  and  consequences. 

For  example,  tests  with  smaller  errors  will  provide  better  indications 
of  how  well  running  tests  function  as  substitutes  for  laboratory 
tests . 

Objective 

Published  studies  of  the  reliability  of  V02max  tests  provide  an 
empirical  basis  for  estimating  V02max  test  precision.  This  review 
employed  meta-analysis  procedures  to  model  V02max  test  precision. 

Approach 

Studies  of  the  test-retest  reliability  of  V02max  protocols  were 
identified  from  previous  reviews  and  searches  of  computerized 
databases  for  biomedical,  behavioral,  and  sports  research.  Of  51 
studies  identified,  12  were  dropped  because  long  test-retest  intervals 
made  it  likely  that  true  V02max  values  changed  during  the  study.  The 
reported  means,  standard  deviations,  and  test-retest  correlations  were 
used  to  compute  the  standard  error  of  measurement  (SEM)  for  V02max  for 
the  remaining  39  studies.  The  age  and  gender  composition  of  the  sample 
were  coded  along  with  the  exercise  mode  (treadmill,  cycle  ergometer, 
other)  and  the  test-retest  interval  for  the  protocol.  Meta-analysis 
produced  a  predictive  model  for  SEM  based  on  sample  and  protocol 
attributes . 

Results 

Average  SEM  was  2.58  ml  •  kg-1  •  min-1 .  SEM  was  higher  in  samples  with 
higher  average  V02max.  Age,  gender,  test  interval,  and  exercise  mode 
were  not  related  to  SEM.  After  allowing  for  outliers,  the  final  model 
to  predict  SEM  was  In  (SEM)  =  0.661  +  (.006  *  V02max)  . 

Conclusions 

SEM  increases  as  the  average  V02max  of  the  sample  increases.  Other 
population  and  protocol  attributes  were  not  related  to  SEM.  The 
potential  applications  of  the  model  for  SEM  include  evaluating  new 
V02max  protocols,  evaluating  field  tests  (e.g.,  run  tests,  walk  tests), 
and  making  allowances  for  measurement  error  when  investigating  the 
relationships  of  V02max  with  other  variables. 
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Introduction 


Cardiorespiratory  fitness  is  important  for  health  and  work  and 
athletic  performance.  Maximal  oxygen  uptake  (V02max)  is  the  accepted 
indicator  of  this  physical  capacity.  Laboratory  tests  that  directly 
measure  oxygen  uptake  during  heavy  physical  exertion  are  the  gold 
standard  for  V02max  measurement.  These  tests,  which  can  be  performed  on 
treadmills  or  cycle  ergometers,  involve  technical  and  performance 
factors  that  can  introduce  measurement  errors  (Howley,  Bassett,  & 
Welch,  1995) .  This  paper  summarizes  the  empirical  evidence  regarding 
the  size  of  those  errors. 

Measurement  error  biases  empirical  estimates  of  relationships 
between  V02max  and  other  variables.  The  bias  produces  estimated 
associations  that  are  less  than  the  true  population  relationships 
(Nunnally  &  Bernstein,  1994) .  The  technical  term  for  this 
underestimation  is  attenuation  due  to  measurement  error.  Better 
estimates  of  population  parameters  can  be  obtained  by  adjusting  for 
this  attenuation.  The  magnitude  of  error  must  be  known  to  make  the 
necessary  corrections. 

This  review  examines  the  measurement  error  for  V02max  tests  when 
the  results  are  expressed  as  milliliters  of  oxygen  uptake  per  kilogram 
of  body  weight  per  minute  (ml  •  kg-1  •  min-1 )  .  Meta-analysis  provides  a 
model  to  predict  SEM  based  on  the  pooled  evidence  from  available 
studies . 


Methods 


Data  Sources 

The  PUBMED®  computer  database  was  searched  to  identify  relevant 
studies.  The  search  keywords  were  reliability  or  reproducibility 
combined  with  maximal  oxygen  uptake  or  V02max.  The  resulting  set  of 
articles  was  augmented  with  citations  from  Safrit,  Hooper,  Ehlert, 
Costa,  and  Patterson  (1988)  and  Hopkins,  Schabort,  &  Hawley  (2001). 
The  references  in  the  articles  identified  in  these  first  2  steps  were 
examined  to  identify  additional  studies. 

The  studies  in  this  review  met  three  criteria.  First,  oxygen 
uptake  was  expressed  in  units  of  ml  •  kg-1  •  min-1 .  This  size-adjusted 
expression  is  the  most  common  index  of  cardiorespiratory  capacity  in 
studies  of  health  and  performance.  Second,  the  study  reported  SEM  or 
sufficient  information  to  compute  SEM  (i.e.,  the  standard  deviation 
and  rxx  or  intraclass  correlation  [ICC]  for  V02max)  .  Third,  the  test- 
retest  interval  was  no  more  than  3  weeks.1 


1Twelve  studies  met  the  first  two  criteria  but  had  retest  intervals  longer 
than  5  weeks.  Preliminary  analysis  indicated  that  SEM  was  much  larger  in 
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Table  1 .  Descriptive  Data 


Mean 

Median 

Minimum 

Maximum 

Age 

30.3 

27 . 2 

9.8 

79.8 

vo2max 

45.3 

46.2 

13.1 

68 . 5 

Reliability  (rxx)a 

.890 

.  909 

.  620 

.  970 

SDtb 

5.68 

5.55 

1 . 52 

9.50 

SEMC 

2 . 64 

2 . 53 

1 . 30 

5.00 

Note.  Table  entries  are  the  weighted  statistics  for  the  raw  data. 
Sample  size  was  the  weighting  factor.  Values  for  rxx,  SEM,  and  SDt 
reported  in  the  text  may  differ  from  these  because  the  raw  data  were 
transformed  to  approximate  normal  distributions  before  analysis. 
aTest-retest  reliability  coefficient. 
bStandard  deviation  of  true  scores. 

°Standard  error  of  measurement. 

Thirty-one  (31)  studies  covering  39  samples  with  745  total 
participants  met  these  criteria.  The  studies  included  29  treadmill,  8 
cycle  ergometer,  and  2  miscellaneous  (e.g.,  tethered  swimming) 
protocols.  The  protocols  included  500  treadmill  tests,  187  cycle 
ergometer  tests,  and  58  miscellaneous  tests. 

Coding  Procedures 

The  mean  and  standard  deviation  for  each  test  administration  and 
the  correlation  between  scores  (i.e.,  rxx)  were  recorded  when  reported. 
When  raw  data  were  reported,  statistics  in  the  paper  were  confirmed  by 
repeating  the  basic  data  analysis.  When  the  ICC  was  reported,  ICC  and 
the  number  of  test  administrations  (k)  were  entered  into  the  Spearman- 
Brown  formula,  rICC  =  k r±j/  (1  +  r±j)  where  r±j  is  the  average  correlation 
between  V02max  values  for  the  ith  and  jth  test  administrations  (Ghiselli, 
Campbell,  &  Zedeck,  1981,  p.  232)  .  The  formula  was  solved  for  r±j, 
which  then  was  the  study  estimate  of  rxx. 

Additional  information  was  extracted  to  examine  factors  that 
might  modify  SEM.  Gender,  age,  and  V02max  were  recorded  as  sample 
attributes  that  might  indicate  limits  on  the  generalizability  of  SEM 
(see  Table  1) .  Study  design  attributes  were  recorded  to  identify 
methodological  factors  that  could  be  controlled  to  minimize  SEM. 
Exercise  mode  coded  as  treadmill  ( k  =  29  samples,  N  =  500  cases), 
cycle  ergometer  (k  =  8,  N  =  187),  and  other  (k  =  2,  N  =  58)  .2  Only  27 


those  studies  than  in  the  39  studies  retained  for  analysis.  Changes  in  true 
V02max  would  be  one  possible  explanation  for  the  large  errors.  Dropping  these 
studies  helped  ensure  that  the  review  evaluated  protocol  performance  without 
the  confounding  effects  of  changes  in  V02max. 

2The  initial  study  plan  for  the  review  included  coding  protocol  attributes 
(e.g.,  how  initial  work  rate  was  determined,  frequency  and  size  of  work  rate 
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studies  provided  enough  information  to  estimate  test-retest  interval. 
Typical  descriptions  referred  to  a  range  of  times  (e.g.,  "7  to  10 
days,"  "7  or  more  days")  between  tests.  This  practice  no  doubt 
reflects  the  difficulty  of  maintaining  precise  scheduling  when 
participants  must  return  to  the  laboratory  more  than  once.  When  a 
range  was  given,  the  midpoint  of  the  range  was  recorded.  When  only  a 
lower  bound  was  given,  this  minimum  value  was  recorded.  The  initial 
interval  estimates  were  recoded  into  categories:  "<1  week"  (k  =9;  N  = 
177),  "7-10  days"  (k  =11,  N=  221),  and  "2-3  weeks"  (k  =7;  N=  134). 
Test  interval  could  not  be  determined  for  12  samples  (N  =  213) .  A 
missing  data  value  was  entered  for  those  samples. 

Analysis  Procedures 

SEM  and  SDt  were  computed  as  follows: 

SEM  =  V(1  -  rxx2)  *SDV02max 
rxx*SDV02max 

These  variance  components  and  rxx  were  transformed  to  obtain  normal 
distributions  with  known  variances  (Raudenbush  &  Bryk,  2002,  p.  219, 
for  the  conversion  formulae) .  The  meta-analyses  were  conducted  by 
applying  standard  regression  and  general  linear  model  procedures 
(SPSS,  Inc.,  1998a,  1998b)  to  the  transformed  variables.  In  these 
analyses,  the  transformed  variance  component  or  correlation  was 
weighted  by  the  inverse  of  its  known  variance.  Given  this  weighting, 
the  sum  of  squares  from  the  analyses  provided  Hedges' s  Q  (Hedges  & 
Olkin,  1985,  pp .  241-242)  .  The  Q  statistic  has  a  y2  distribution  with  k 
-  1  degrees  of  freedom  ( df )  where  k  is  the  number  of  correlations  or 
variance  estimates  being  analyzed. 

Preliminary  analyses  established  two  facts  that  affected 
decisions  regarding  the  results  reported  here.  First,  the  mean  and 
standard  deviation  of  the  initial  V02max  test  was  an  acceptable  estimate 
of  these  statistics  for  both  test  administrations  (Appendix  A) . 

Second,  rxx  was  substantially  lower  and  SEM  substantially  higher  when 
more  than  3  weeks  elapsed  between  tests.  This  difference  was  expected 
because  testing  conditions  could  change  over  the  longer  intervals. 
Possible  changes  include  alterations  in  true  V02max  scores.  The 
possibility  of  substantial  changes  in  the  person,  the  laboratory 
equipment,  seasonal  effects  on  physical  activity  and  other  factors 
would  introduce  major  elements  of  uncertainty  into  attempts  to 
evaluate  SEM.  Based  on  these  preliminary  analyses,  the  results 
reported  here  are  based  on  the  mean  and  standard  deviation  from  the 
first  test  session  for  studies  with  test  intervals  ^3  weeks. 


increments,  criteria  for  a  valid  V02max)  •  Protocol  details  were  missing  from 
too  many  studies  to  support  the  planned  analysis. 
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Table  2.  Categorical  Predictors  of  Reliability  and  Precision 


SDt 

SEM 

r 

Test  interval 

<7  days 

5.39 

2 .56 

.  905 

7-10  days 

4 .89 

2 . 65 

.883 

2-3  weeks 

5.58 

3.00 

.887 

x2 

3.06 

3.65 

1.06 

p  value 

.216 

.161 

.588 

Exercise  mode 

Cycle  ergometer 

6.50 

2 . 47 

.  936 

Treadmill 

5.15 

2 .59 

.897 

Other 

6.67 

2 . 87 

.  920 

x2 

17.06 

1 .  96 

7 .71 

p  value 

.  001 

.  375 

.022 

Gender 

Missing 

4 . 77 

2 . 71 

.875 

Male 

6.27 

2 . 71 

.  920 

Female 

3.87 

1 .  94 

.898 

M  +  F 

7 .20 

3.58 

.896 

x2 

50.46 

30 . 83 

4.37 

p  value 

.  001 

.  001 

.225 

Results 

Bivariate  Relationships 

Test  Interval.  SEM  increased  slightly,  but  consistently,  as  the 
interval  between  tests  increased,  but  the  trend  was  not  statistically 
significant  (y2  =  3.65,  2  df,  p  >  .161) .  SDt  was  not  related  to  test 
interval  (y2  =  3.06,  2  df,  p  >  .216)  .  Test-retest  reliability,  rxx,  did 
not  vary  (y2  =  1.06,  2  df,  p  >  .588)  . 

The  estimates  of  test  interval  effects  may  be  biased.  Studies  for  which 
interval  could  not  be  estimated  had  higher  average  SDt  (6.62  ml  •  kg-1 -min-1) 
and  lower  average  SEM  (2.28  ml  •  kg-1 -min-1)  compared  with  studies  with 
interval  data.  Test-retest  reliability  was  higher  reliability  (rxx  = 
.946) .  The  differences  were  statistically  significant  (p  <  .001  for 
each) .  The  missing  data  would  bias  the  estimates  of  interval  effects 
if  the  studies  with  missing  data  all  had  approximately  the  same 
interval.  The  direction  and  magnitude  of  the  bias  would  depend  on 
where  the  cluster  was  located  on  the  time  continuum. 
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Exercise  Mode.  SEM  was  not  related  to  exercise  mode  (x2  =  1.96,  2  df,  p 

>  .375)  .  SDt  was  lower  for  cycle  ergometer  protocols  than  for  treadmill 
and  other  protocols  (y2  =  17.06,  2  df,  p  <  .001)  .  Combining  these 
trends  produced  small,  but  statistically  significant  differences  in  rxx 
(X2  =  7.71,  2  df,  p  <  .001). 

Gender.  SEM  was  larger  for  males  (2.71)  than  for  females  (1.94; 

X2  =  22.03,  1  df,  p  <  .001) .  SDt  was  greater  in  samples  of  male  (SDt  = 
6.27)  than  in  samples  of  females  (SDt  =  3.87;  x2  =  46.03,  1  df,  p  < 

.001) .  These  opposite  trends  combined  to  yield  comparable  rxx  values 
for  men  and  women  (y2  =  1.62,  1  df,  p  >  .204)  . 

Age.  Age  was  not  related  to  SEM  (r  =  -.15,  x2  =  2.86,  1  df,  p  > 
.090)  or  SDt  (r  =  .12,  x2  =  1.77,  1  df,  p  >  .183) .  These  weak  opposing 
trends  produced  increasing  rxx  with  age  (r  =  .27,  x2  =  4.10,  1  df,  p  < 

.  043)  . 

V02max •  V02max  was  positively  related  to  SEM  (r  =  .35,  x2  =  16.59,  1 
df,  p  <  .001),  but  was  not  related  to  SDt  (r  =  .02,  x2  =  0.04,  1  df,  p 

>  .841)  .  The  combined  trends  produced  a  negative  relationship  between 
rxx  and  V02max  (r  =  -.31,  y2  =  6.07,  1  df,  p  <  .014). 

Multivariate  Model  for  SEM 

The  general  linear  model  routine  of  SPSS-PC  was  used  to  combine 
V02max  and  gender,  the  significant  bivariate  correlates  of  SEM,  into  a 
multivariate  model.  Each  variable  contributed  independently  to  the 
prediction  of  SEM  (V02max,  x2  =  5.94,  1  df,  p  <  .015;  gender,  x2  =  13.44, 
1  df,  p  <  .001)  .  The  regression  formula  was 

in  (SEM)  =  0.985  +  (  .  0  0  6  *  V02max)  -  (.275*  Gender)  (Equation  1) 
Sensitivity  Analysis 

Meta-analysis  should  attempt  to  evaluate  the  sensitivity  of  the 
results  to  assumptions  embedded  in  the  analysis  (National  Research 
Council,  1992) .  The  potential  bias  associated  with  missing  time 
interval  data  has  been  alluded  to  previously.  The  construction  of  the 
multivariate  model  therefore  was  followed  by  exploration  of  several 
factors  that  might  have  affected  the  content  and  structure  of  the 
model . 


Gender  Coding.  The  assumptions  made  in  coding  gender  might  have 
affected  the  model.  Perhaps  gender  was  less  likely  to  be  reported  when 
the  sample  was  composed  of  males.  This  assumption  could  be  valid  if 
male  gender  was  an  implicit  default  value  in  this  research  domain. 
Samples  with  missing  data  were  reclassified  as  male  to  test  this 
possibility . 
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The  reclassification  had  little  effect.  Both  gender  and  V02max 
were  significantly  related  to  SEM.  The  regression  slope  for  V02max  was 
unchanged.  The  slope  for  gender  was  .003  smaller.  The  intercept  was 
.011  larger.  These  changes  were  reasonable  given  that  the  average 
values  of  SEM,  SDt,  and  rxx  for  the  samples  with  missing  data  were  very 
similar  to  the  average  values  for  male  samples. 

Additional  Female  Data.  The  modest  amount  of  data  available  for 
females  was  a  second  concern.  The  evidence  included  only  6  samples  of 
women.  Data  from  Katch,  Sady,  and  Freedson  (1982)  were  added  to 
increase  the  total  number  of  observations  for  women.  That  study 
included  an  intensive  investigation  of  4  women.  Each  of  these  four 
women  completed  between  10  and  21  tests  with  at  least  1  day  between 
tests.  Each  woman  completed  her  series  of  tests  in  2  to  4  weeks.  The 
Katch  et  al .  (1982)  data  had  not  been  included  in  the  analyses  to  this 

point  because  the  set  of  tests  for  each  woman  comprised  a  time  series. 
Correlations  between  errors  could  occur  that  would  lead  to 
underestimation  of  error  variance  (Ostrom,  1990) .  The  possible  lack  of 
independence  between  observations  also  raises  special  statistical 
problems  in  meta-analysis  (Becker  &  Schram,  1994) .  However,  the  data 
were  used  in  this  sensitivity  analysis  because  the  primary  objective 
was  to  improve  the  estimate  of  average  SEM  rather  than  to  make 
statistical  inferences. 

The  unweighted  average  mean  squared  error  for  the  four  series  was 
3.2  ml • kg-1  • min"1 .  This  value  was  larger  than  the  estimated  average  SEM 
for  women  in  the  6  test-retest  studies.  In  fact,  this  error  was  larger 
than  the  estimated  value  for  men.  Adding  these  data,  the  estimated  SEM 
for  females  increased  from  2.02  ml  •  kg"1  •  min"1  to  2.30  ml  •  kg"1  •  min"1 . 
Although  statistical  inferences  based  on  these  data  must  be  viewed 
with  caution,  it  is  worth  noting  that  the  gender  difference  still 
would  be  statistically  significant  (y2  =  3.90,  1  df,  p  <  .049)  if  the 
measurement  errors  were  treated  as  independent  from  session  to 
session . 

Outlier/Influential  Data  Points.  A  point  noted  when  coding  the 
data  was  examined  next.  The  standard  deviation  of  V02max  had  been  coded 
in  two  prior  reviews  of  V02max  as  a  predictor  of  running  performance 
(Vickers,  2001a,  2001b) .  The  distribution  of  standard  deviations 
indicated  a  typical  standard  deviation  of  ~6.00  ml • kg"1  • min"1 .  Very  few 
values  were  <3.00  ml  •  kg"1  •  min"1  or  >9.00  ml  •  kg"1  •  min"1 .  Thus,  a  small 
sample  of  studies  such  as  that  covered  in  this  review  should  include 
very  few  standard  deviations  beyond  the  range  from  3  to  9  ml • kg"1  •  min"1 . 
Other  things  equal,  values  outside  this  range  would  produce  extreme 
SEM  values. 

Analysis  of  the  standard  deviations  from  the  prior  reviews 
(Vickers,  2001a,  2001b)  gave  reason  to  believe  the  current  set  of 
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studies  was  not  broadly  representative  of  men  and  women.  The  samples 
in  this  review  accentuated  a  gender  difference  evident  in  the  larger 
body  of  evidence.  The  male  standard  deviation  was  larger  than  expected 
{SD  =  6.77  ml  •  kg-1  •  min-1  vs.  SD  =  6.16  ml  •  kg-1 -min”1,  x2  =  59.95,  26  df,  p 
<  .001)  .  The  female  standard  deviation  was  smaller  than  expected  {SD  = 
4.31  ml  •  kg-1  •  min-1  vs.  5.66  ml  •  kg-1  •  min-1,  x2  =  26.52,  6  df,  p  <  .001)  . 

The  gender  difference  in  the  data  reviewed  here  was  5  times  what  would 
generally  be  expected  (2.46  ml  •  kg-1  •  min-1  vs.  0.50  ml  •  kg-1  •  min”1 )  .  This 
trend  produced  a  larger  x2  for  the  male-female  difference  in  the  review 
data  than  in  the  larger  body  of  evidence  (x2  =  22.03  vs.  x2  =  16.07) 
despite  the  smaller  cumulative  sample  size  in  the  present  data. 

The  weighted  average  standard  deviation  for  V02max  was  computed 
for  121  samples  of  men  and  51  samples  of  women  in  the  prior  reviews. 

The  average  standard  deviations  for  men  and  women  then  were  the  points 
of  reference  for  computing  z-scores  for  the  studies  in  this  review 
(i.e.,  [ln(Sample  SD)  -  In (Average  SD) ] *2f;  f  =  N  -  1,  cf.,  Raudenbush 
&  Bryk,  2002,  p.  219) .  The  computations  produced  |z|  >  3.00  for  6 
studies.  This  frequency  was  >56  times  the  number  that  would  be 
expected  by  chance.  These  samples,  therefore,  could  be  classified  as 
outliers  (Barnett  &  Lewis,  1978) .  Further  analyses  were  undertaken  to 
determine  the  impact  of  the  outliers  on  the  prior  analysis  findings 
(cf.,  Belsley,  Kuh,  &  Welsch,  1980;  Stevens,  1984). 

The  extreme  values  were  not  randomly  distributed.  Three  of  six 
female  samples  produced  z  <  -3.00.  For  males,  two  samples  produced  z  > 
3.00;  one  sample  yielded  z  <  -3.00.  In  the  context  of  this  study,  the 
extreme  standard  deviations  strongly  suggested  that  SEM  might  be 
underestimated  for  women.  The  implications  for  men  were  less  clear, 
but  it  was  possible  that  SEM  was  overestimated  for  males. 

Each  gender  analysis  described  above  was  repeated  after  removing 
the  extreme  samples.  When  this  was  done,  men  and  women  had  virtually 
identical  SEM  values  (x2  <  0.40)  .  The  removal  did  not  affect  the  V02max_ 
SEM  relationship.  This  association  remained  positive  and  statistically 
significant . 

Re-examination  of  Exercise  Mode.  The  effect  of  exercise  mode  on 
SEM  was  reexamined  to  complete  the  sensitivity  checks.  The  question 
was  whether  exercise  mode  was  related  to  SEM  controlling  for  V02max  and 
gender.  This  analysis  was  not  based  on  any  prior  finding.  The  question 
was  posed  to  check  on  a  logical  possibility  that  would  be  important  if 
true.  The  initial  analysis  covered  all  of  the  exercise  modes.  The 
analysis  was  repeated  for  the  subset  of  studies  involving  either  the 
treadmill  or  cycle  ergometer  protocols.  In  each  analysis,  the  x2  for 
exercise  mode  was  less  than  would  be  expected  by  chance  (i.e.,  ^ / df  < 
1.00)  . 
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Revised  Model 


The  sensitivity  analyses  suggested  that  the  gender  effects  in 
Equation  1  were  questionable.  Therefore,  a  final  predictive  model  for 
SEM  was  constructed  with  V02max  as  the  only  predictor.  The  model  was 

ln(SEM)  =  0.531  t  (.009  *  V02max)  (Equation  2a) 

This  equation  yields  SEM  =  2.23  when  V02max  =  30  ml  •  kg-1 -min'1,  SEM  = 

2.67  ml  •  kg-1  •  min-1  when  V02max  =  50  ml  •  kg-1  •  min-1,  and  3.19  ml  •  kg-1  •  min-1 
when  V02max  =  7  0  ml  •  kg-1  •  min-1 .  When  the  samples  with  exceptional 
standard  deviations  were  removed,  the  equation  was 

In  (SEM)  =  0.661  t  (.006  *  V02max)  (Equation  2b) 

The  relationship  to  V02max  remained  significant  (r  =  .23,  x2  =  4.43,  1 
df,  p  <  .036) .  The  revised  equation  yields  SEM  estimates  of  2.32  when 
V02max  =  30  ml  •  kg-1  •  min-1,  2.61  when  V02max  =  2.67  at  50  ml  •  kg-1  •  min-1,  and 
2.95  when  V02max  =  70  ml  •  kg-1  •  min-1 .  Equation  2b  can  be  seen  as  a  robust 
version  of  Equation  2a  because  the  outlier  data  points  have  been  given 
less  weight  (Rousseeuw  &  Leroy,  1987) . 

Discussion 

This  review  of  the  available  evidence  regarding  the  measurement 
precision  of  V02max  tests  provides  a  basis  for  addressing  several 
general  issues.  First,  the  review  produced  reasonably  firm  conclusions 
about  some  factors  that  might  influence  precision.  Second,  the  review 
defined  some  topics  for  continuing  research  by  showing  that  additional 
evidence  is  needed  to  reach  conclusions  about  other  factors  that  might 
affect  precision.  Third,  the  review  produced  a  simple  model  for 
estimating  precision  in  studies  that  lack  repeated  measures.  SEM 
estimates  from  this  model  can  be  useful  when  evaluating  findings  from 
past  and  future  studies.  Finally,  the  evidence  illustrated  that  the 
standard  error  for  a  test  should  be  the  preferred  statistical  index  of 
test  performance.  These  points  are  discussed  below. 

Firm  conclusions  could  be  reached  regarding  3  factors  that  might 
influence  the  precision  of  V02max  measurements.  Two  findings  are 
negative.  Age  does  not  affect  precision.  Treadmill  and  cycle  ergometer 
tests  have  equal  precision.  The  third  finding  provides  the  basis  for  a 
model  to  estimate  SEM.  Precision  is  lower  when  V02max  is  higher.  Results 
obtained  from  the  analysis  of  sample  statistics  must  be  extrapolated 
to  individuals  to  reach  this  third  conclusion,  but  the  extrapolation 
seems  reasonable.  The  applied  uses  of  a  model  based  on  these 
conclusions  are  considered  after  summarizing  the  negative  findings. 

The  evidence  regarding  3  other  potential  influences  on  V02max  test 
precision  was  ambiguous.  Outliers  made  it  impossible  to  estimate 


gender  effects  with  precision.  Lack  of  replication  made  it  impossible 
to  decide  whether  tests  involving  alternative  exercise  methods  (e.g., 
tethered  swimming)  produced  larger  than  average  SEM  values.  Two 
methods  have  been  tested  and  both  produced  larger  than  average  SEM 
values.  Neither  result  has  been  replicated  to  date.  Also,  those  two 
methods  are  not  necessarily  representative  of  the  universe  of 
alternatives  to  treadmill  and  cycle  ergometry  protocols.  Finally,  SEM 
probably  increases  as  the  time  between  measurements  increases,  but 
this  position  cannot  be  adopted  with  certainty.  The  weak  time  trend 
shown  in  Table  1  and  preliminary  analyses  showing  larger  SEM  values  in 
studies  with  intervals  in  excess  of  5  weeks  support  the  presence  of  a 
time  effect.  However,  time  interval  could  not  be  coded  for  a  subset  of 
studies.  The  studies  in  that  subset  had  small  SEM  values.  Their 
distribution  along  the  time  dimension  could  dramatically  affect  any 
temporal  trend. 

The  established  facts  generate  a  simple  model  of  V02max  test 
precision.  Precision  decreases  as  V02max  increases.  Application  of  the 
model  involves  two  steps.  First,  compute  the  natural  logarithm  of  SEM, 
ln(y)  =  0.661  +  (.006  *  V02max)  •  Second,  compute  the  estimated  SEM,  SEM' 
=  exp(y) .  The  areas  of  uncertainty  discussed  in  the  preceding 
paragraph  make  it  likely  that  this  simple  model  is  incomplete.  There 
is  a  strong  likelihood  that  a  complete  model  would  include  time 
interval  between  measurements.  It  is  less  likely,  but  still  possible, 
that  a  complete  model  also  would  include  gender.  However,  the  current 
model  is  based  on  the  only  association  definitely  supported  by  the 
available  evidence. 

The  SEM  estimates  derived  from  the  predictive  model  provide  a 
frame  of  reference  for  evaluating  2  types  of  research  results.  The 
first  type  evaluates  methods  of  assessing  cardiorespiratory  fitness. 

In  this  context,  the  model  estimates  provide  a  benchmark  for  new  V02max 
protocols.  When  a  new  protocol  is  being  evaluated,  Equation  2b  can  be 
applied  to  estimate  the  treadmill  or  cycle  ergometer  SEM  for  the  study 
sample.  The  observed  SEM  can  be  compared  with  the  estimate  by 
computing  z  =  (SEM  -  SEM')*(2N-4)  (cf.,  Raudenbush  &  Bryk,  2002,  p. 

219).  Standard  statistical  criteria  (p  <  .05,  one-tailed)  can  be 
applied  to  decide  whether  the  SEM  for  the  new  protocol  exceeds  that 
for  the  reference  standards.  The  model  predictions  also  can  have  a 
role  in  the  validation  of  field  tests  of  cardiorespiratory  fitness.  In 
this  context,  V02max  test  results  can  be  regressed  on  field  test 
performance  (e.g.,  run  time)  to  obtain  a  standard  error  of  estimate 
(SEE) .  If  the  z-score  computations  show  that  the  field  test  is  less 
precise  than  the  laboratory  test,  as  would  be  expected,  the  increase 
in  error  associated  with  the  field  test  can  be  estimated  by  computing 
V(SEEField2  -  SEMLab2)  . 

The  second  general  application  of  the  SEM  model  involves 
correcting  for  the  effects  of  measurement  error  when  studying 
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relationships  between  V02max  and  other  variables.  SEM'  estimates  from 
Equation  2  can  be  used  to  compute  rxx  =  (1  -  SEM' 2/SD2)  .  Inserting  the 
computed  rxx  into  the  formula  pxy  =  rxy/Vrxx  provides  an  estimate  of  the 
true  population  correlation,  pxy,  by  correcting  the  observed 
correlation,  rxy,  for  unreliability  of  x  variable,  V02max.  The 
computation  treats  the  y  variable  as  though  it  were  measured  without 
error  (i.e.,  ryy  =  1.00)  .  By  doing  so,  the  correction  focuses 
specifically  on  the  effects  of  measurement  errors  in  V02max.  The 
correction  is  not  limited  to  correlational  studies.  Similar 
adjustments  can  be  applied  to  a  wide  range  of  analysis  procedures  by 
transforming  group  differences  into  effect  size  estimates  comparable 
to  rxy.  For  example,  the  difference  between  two  groups  in  an  experiment 
can  be  expressed  as  a  point  biserial  correlation  (cf..  Hedges  &  Olkin, 
1985) .  The  effects  of  these  corrections  can  be  substantial  (Appendix 
D)  . 


The  potential  applications  of  SEM  estimates  are  important.  For 
example,  consider  a  study  relating  V02max  to  some  other  variable  (e.g., 
a  training  method,  an  ergogenic  aid)  in  a  small  sample  of  endurance 
athletes.  The  true  effect  size  is  likely  to  be  underestimated. 

Selection  processes  that  determine  who  becomes  an  endurance  athlete 
are  likely  to  cause  true  differences  in  V02max  to  be  smaller  than  in  the 
general  population.  At  the  same  time,  the  high  average  level  of  V02max 
implies  higher  than  average  SEM.  The  combination  of  restricted  true 
score  variance  and  large  error  variance  will  yield  attenuated 
estimates  of  associations  between  V02max  and  other  variables.  Tests  of 
statistical  significance  that  combine  small  effect  sizes  with  small 
samples  have  low  statistical  power  and  are  not  likely  to  reject  the 
null  hypothesis.  Even  moderate  to  strong  true  effects  can  fail  to 
reach  statistical  significance  under  these  conditions.  Using  the  SEM 
model  to  estimate  the  effect  of  measurement  error  in  such  a  study 
could  reduce  the  risk  of  dismissing  promising  lines  of  work 
prematurely . 

The  review  findings  also  demonstrated  that  SEM  is  preferable  to 
other  statistical  indices  when  evaluating  the  measurement 
characteristics  of  V02max  protocols.  A  distinction  between  absolute  and 
scaled  indices  of  test  performance  is  the  key  issue  here.  SEM 
quantifies  the  reproducibility  of  individual  V02max  values  in  absolute 
terms.  SEM  is  the  average  expected  error.  Other  widely  used  statistics 
scale  SEM  by  expressing  it  relative  to  some  sample  characteristic. 
Scaling  is  most  evident  for  the  coefficient  of  variation  (CV) .  CV 
expresses  SEM  as  a  percentage  of  average  V02max  (i.e.,  CV  = 
SEM/Average*100 ) .  Test-retest  reliability,  rxx,  scales  SEM  relative  to 
the  sample  standard  deviation.  This  reliability  index  is  defined  as 
the  ratio  of  true  score  variance  to  total  variance  (i.e.,  rxx  = 

SDt2/SD2;  Nunnally  &  Bernstein,  1994)  .  Because  SD2  =  SDt2  +  SEM2,  it  is 
also  true  that  rxx  =  1  -  SEM2/SD2.  Thus,  rxx  scales  SEM  relative  to  the 
sample  standard  deviation.  Both  rxx  and  CV  are  composite  statistics  in 
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that  each  combines  SEM  with  a  sample  characteristic.  The  analysis  of 
SDt  in  this  review  demonstrated  that  the  sample  characteristics 
sometimes  are  associated  with  factors  when  SEM  is  not  (cf..  Table  1) . 
The  discussion  of  Equation  2  showed  that,  in  the  present  data,  CV  can 
decrease  when  SEM  is  increasing.  In  both  cases,  variations  in  the 
composite  indices  are  a  poor  guide  to  the  precision  of  test  scores.  In 
the  final  analysis,  experts  regard  reproducibility  of  test  results  as 
the  essence  of  reliability  (e.g.,  American  Psychological  Association, 
1994) .  SEM  is  the  best  index  for  assessing  reliability  because  it 
separates  test  precision  from  population  attributes. 

The  volume  and  quality  of  evidence  must  be  considered  when 
evaluating  the  points  made  in  this  discussion.  Publication  bias  occurs 
when  only  statistically  significant  results  are  published  (National 
Research  Council,  1992) .  This  practice  inflates  parameter  estimates 
because  studies  that  produce  smaller  values  are  missing  from  the 
published  record.  This  form  of  bias  is  unlikely  in  the  present  case. 

In  this  domain,  significance  tests  would  be  expected  to  evaluate  the 
null  hypothesis  that  rxx  =  .00.  Given  the  average  value  of  rxx  for  the 
studies  reviewed  here  (i.e.,  rxx  =  .91)  and  a  sample  size  of  15  (i.e., 
the  median  for  the  studies  reviewed) ,  the  null  hypothesis  will  be 
rejected  99.99%  of  the  time.  Publication  bias  therefore  does  not 
appear  likely  to  have  had  a  major  effect  on  the  present  findings. 

Another  trend  in  the  evidence  may  appear  to  be  a  cause  for 
concern.  The  analyses  identified  15%  of  the  studies  as  outliers.  This 
rate  is  not  exceptional.  Outliers  commonly  comprise  10%  to  20%  of  the 
data  in  fields  as  diverse  as  behavioral  research  and  particle  physics 
(Hedges,  1987;  Hedges  &  Olkin,  1985) .  Furthermore,  the  conclusions 
drawn  from  the  evidence  have  taken  account  of  the  outliers.  The 
evidence  regarding  gender  effects  was  treated  as  inconclusive  because 
the  outlier  data  points  affected  this  element  of  the  analyses. 

To  summarize,  this  review  provided  a  simple  model  of  the 
measurement  precision  (i.e.,  SEM)  of  V02max  tests.  The  model  is  probably 
incomplete,  but  it  represents  the  combined  evidence  from  available 
studies  of  V02max  reliability.  Further  research  on  the  effects  of  time 
interval  between  tests  and  gender  differences  in  SEM  could  refine  this 
initial  model.  The  model  generates  SEM  estimates  that  can  be  used  as 
benchmarks  when  evaluating  new  V02max  protocols.  The  SEM  estimates  also 
can  be  applied  to  correct  for  measurement  error  when  estimating 
associations  between  V02max  and  other  variables.  From  a  statistical 
perspective,  the  evidence  indicates  that  SEM  provides  a  better 
indication  of  test  performance  than  rKX  or  CV  when  evaluating  V02max 
protocols.  This  summary  sketch  of  the  available  evidence  can  be  a 
framework  for  future  studies  of  the  measurement  properties  of  V02max 
protocols . 
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Appendix  B 

Data  Coded  from  Studies 


Table  B-l .  Descriptive  Data 


Sr.  Author 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Aunola 

33 

1 

1 

2 

33 . 0 

7 . 8 

177 

4  .  9 

75 . 6 

7 . 5 

Babcock 

79 

1 

1 

. 

50 . 1 

13.9 

180 

6.0 

83 . 9 

12 . 2 

Bar-Or 

41 

1 

3 

3 

28.2 

8 . 8 

174 

9.8 

70 . 8 

13.2 

Boileau 

21 

1 

2 

2 

12 . 8 

1 . 1 

159 

13.3 

49.1 

12 . 8 

Boileau 

21 

1 

1 

2 

12 . 8 

1 . 1 

159 

13.3 

49.1 

12 . 8 

Brandon 

26 

1 

2 

. 

26 . 7 

4 . 4 

178 

4  .  6 

69.5 

7  .  9 

Braun 

12 

1 

1 

4 

21 . 9 

1 . 8 

178 

11 .  9 

72 . 6 

8 . 3 

Conley 

12 

. 

1 

. 

68 . 8 

5.9 

. 

. 

72 . 1 

2 . 1 

Cunningham  76  15 

1 

1 

1 

10 . 6 

.  3 

141 

5.6 

35.5 

5.4 

Cunningham  77  66 

1 

2 

4 

10 . 4 

.  3 

140 

6.5 

33.5 

5.3 

De  Meersma 

9 

1 

1 

4 

20 . 0 

. 

. 

. 

. 

. 

De  Vito 

6 

1 

2 

2 

27 . 0 

5.0 

176 

8 . 0 

69 . 0 

9.0 

Farrell 

18 

1 

2 

. 

28 . 0 

9.0 

180 

6.7 

70.2 

8 . 1 

Fielding 

17 

2 

2 

2 

59 . 0 

4 . 1 

162 

4 . 5 

62.5 

8 . 2 

Foster 

8 

2 

2 

2 

79 . 8 

4  .  6 

159 

3.9 

58 . 4 

8 . 0 

Froehlicher 

15 

. 

2 

3 

32 . 0 

178 

18. 0 

Froehlicher 

15 

. 

2 

3 

. 

178 

78. 0 

Froehlicher 

15 

. 

2 

3 

. 

178 

78. 0 

Harrison  1 

9 

1 

2 

4 

31 . 0 

71 . 2 

69.9 

Harrison  2 

5 

1 

2 

2 

31 . 1 

. 

. 

Harrison  3 

9 

1 

2 

. 

. 

. 

. 

Harrison  4 

10 

1 

2 

1 

. 

. 

. 

Hazard 

7 

2 

2 

3 

18 . 4 

1 . 1 

164 

4 . 1 

48 . 8 

3.1 

Hazard 

21 

1 

2 

3 

19.0 

2 . 3 

175 

5.1 

61.2 

7 . 2 

Huhn 

20 

1 

2 

1 

25 . 9 

2 . 8 

179 

7 . 1 

72.2 

7 . 4 

Jackson 

156 

1 

2 

4 

45 . 6 

5.0 

. 

. 

82.3 

13.6 

Jackson 

43 

2 

2 

4 

44.2 

8  .  9 

. 

. 

63 . 4 

12 . 0 

Hatch 

36 

2 

2 

2 

20 . 8 

1 . 4 

163 

6.6 

58 . 9 

6.8 

Kohrt 

13 

1 

2 

. 

29.5 

4 . 8 

. 

. 

69.8 

5.6 

Kohrt 

13 

1 

1 

. 

29.5 

4 . 8 

. 

. 

69.8 

5.6 

Kyle 

17 

1 

2 

1 

31 . 9 

4  .  6 

181 

6.3 

78 . 0 

9.7 

Laukkanen 

25 

1 

2 

4 

41 . 4 

. 

. 

. 

80 . 8 

9.3 

Laukkanen 

26 

1 

2 

4 

41 . 4 

. 

. 

. 

84 . 0 

10 . 4 

Laukkanen 

28 

2 

2 

4 

40 . 9 

. 

. 

. 

66 . 8 

8  .  9 

Laukkanen 

29 

2 

2 

4 

40 . 9 

. 

. 

. 

68 . 6 

8  .  6 

MacSween 

25 

3 

2 

1 

28 . 6 

7 . 3 

173 

10 . 0 

69 . 8 

13.8 

Magel 

17 

. 

3 

1 

19 . 8 

1 . 0 

181 

6.1 

76 . 7 

7 . 4 

McArdle 

41 

2 

2 

1 

20 . 9 

1 . 3 

163 

6.0 

58.2 

6.9 

Miller 

5 

1 

1 

. 

26 . 7 

5.8 

178 

5.3 

73 . 9 

7 . 1 

Montgomery 

10 

. 

2 

. 

24 . 8 

4 . 0 

172 

7 . 1 

75 . 1 

15.9 

Paterson 

8 

1 

2 

1 

11 . 4 

. 

147 

7 . 4 

36 . 9 

7 . 5 

Pawelcyzk 

10 

3 

. 

4 

22 . 8 

3.3 

176 

10 . 1 

71 . 7 

17 . 1 

Pivarnik 

32 

2 

2 

2 

13 . 7 

1 . 5 

157 

6.0 

53 . 7 

9.3 
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Table  B-l .  Descriptive  Data  (continued) 


Sr.  Author 

1  2 

3 

4 

5 

6 

7 

8 

9 

10 

Shaver 

10 

1 

2 

. 

22.5 

3.2 

173 

6.7 

75.5 

13.3 

Sproule 

20 

1 

2 

3 

23 . 0 

3.3 

171 

8 . 3 

60 . 8 

7 . 8 

Thomas 

24 

1 

2 

1 

61 . 7 

. 

175 

. 

78 . 7 

. 

Turley 

9 

3 

2 

. 

9.8 

. 

. 

. 

. 

. 

Turley 

9 

3 

1 

. 

9.8 

. 

. 

. 

. 

. 

Walters 

10 

2 

2 

4 

15.3 

1 . 2 

. 

. 

54 . 0 

7 . 3 

Ward 

27 

1 

2 

2 

39 . 1 

10 . 7 

180 

6.7 

78.3 

8 . 4 

Weltman 

15 

1 

2 

2 

27.2 

8 . 2 

175 

7 . 5 

69.1 

8 . 3 

Note.  Columns  are  l=Sample  size  (N) ;  2=Gender;  3=Protocol  Type; 
4=Interval  Group;  5=Average  Age;  6=  SD  Age;  7=Avgerage  Height;  8=  SD 
Height;  9=Average  Weight;  10=SD  Weight  SD  =  standard  deviation?? 
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Table  B-2 .  V02max  Statistics 


Sr.  Author 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Aunola 

47  .  60 

6.20 

48.20 

6.50 

.  920 

.  958 

5 . 1 

2 .49 

1 . 81 

Babcock 

31 . 00 

7.50 

. 

. 

.  960 

.  980 

6 . 8 

2 . 10 

. 

Bar-Or 

30 . 37 

9 .05 

31 . 04 

9 .01 

.  940 

.969 

10.2 

3.08 

2 .21 

Boileau 

48 . 70 

5.30 

49.70 

6 .10 

.  870 

.  930 

5 . 4 

2 . 81 

2 . 13 

Boileau 

44 . 90 

6.30 

46.30 

6 . 60 

.  880 

.  936 

6 . 7 

3.06 

2 .24 

Brandon 

62 . 50 

6 .10 

. 

. 

.  910 

.  953 

4 . 0 

2 . 53 

. 

Braun 

61 . 97 

4 .78 

62 . 12 

3 .84 

.  909 

.  952 

3.2 

1 . 80 

1 . 45 

Conley 

22 . 90 

6.50 

23.80 

6.20 

.  959 

.  979 

8 . 0 

1 . 80 

1 . 30 

Cunningham  76  56.60 

7 .70 

. 

. 

.760 

.864 

8 . 8 

5.00 

. 

Cunningham  77  56.50 

7 .10 

54 . 50 

6 . 60 

.  530 

.  693 

10 . 7 

5.81 

4 . 71 

De  Meersma 

56.20 

2 .40 

60  .  60 

4.20 

.  722 

.  839 

3 . 0 

2 .28 

2 . 10 

De  Vito 

68 . 50 

5.21 

64 . 50 

6 .47 

.  662 

.797 

5 . 7 

4 . 38 

3.49 

Farrell 

43.20 

6 . 60 

. 

. 

.  950 

.  974 

4 . 8 

2 .06 

. 

Fielding 

27 . 50 

4.50 

28 . 30 

5 .40 

.  750 

.  857 

10 . 8 

3.27 

2 . 55 

Foster 

13.10 

2 .00 

13.40 

1 .80 

.760 

.864 

9 . 9 

1 . 23 

.  94 

Froehlicher 

43.90 

5 .70 

44  .  60 

6.20 

.  851 

.  920 

6 . 8 

3.12 

2 . 32 

Froehlicher 

48 . 10 

6 .00 

47.20 

6.20 

.  941 

.  970 

4.2 

2 .06 

1 .49 

Froehlicher 

43.60 

4 .80 

43.30 

5 .70 

.  620 

.765 

8 . 6 

4 . 12 

3.29 

Harrison  1 

63.70 

9.04 

61 . 83 

7 .18 

.  887 

.  940 

6 . 6 

3.75 

3.01 

Harrison  2 

58 . 66 

9.90 

59.84 

10.27 

.  955 

.  977 

5 . 0 

3.00 

2 . 15 

Harrison  3 

60.36 

8 . 94 

61.59 

10 . 67 

.  916 

.  956 

6 . 0 

3.94 

3.08 

Harrison  4 

58.09 

7.36 

58 . 78 

9.34 

.898 

.  946 

5 . 6 

3.67 

3.00 

Hazard 

54 . 44 

4.53 

58 . 96 

5 .47 

.769 

.869 

5.3 

3.20 

2 . 48 

Hazard 

64  .  98 

4.86 

68 . 71 

5.56 

.  786 

.  880 

4 . 6 

3.22 

2 .46 

Huhn 

58.28 

6.09 

58 . 11 

6 . 68 

.  960 

.  980 

2 . 9 

1 .79 

1 . 34 

Jackson 

37.20 

7.30 

37.20 

7 .00 

.  660 

.795 

14 . 7 

5.37 

4 . 17 

Jackson 

30 . 10 

7 .10 

27 . 80 

6 .40 

.  855 

.  922 

12.2 

3.50 

2 . 61 

Hatch 

38 . 90 

4 . 60 

. 

. 

.  950 

.  974 

3 . 7 

1 . 44 

. 

Kohrt 

60 . 50 

5 . 60 

. 

. 

.  970 

.  985 

2.3 

1 .36 

. 

Kohrt 

57 . 90 

5 .70 

. 

. 

.  930 

.  964 

3 . 6 

2 . 10 

. 

Kyle 

56.90 

10 .00 

. 

. 

.  950 

.  974 

5.5 

3.12 

. 

Laukkanen 

43.50 

3.50 

50.20 

4 . 90 

.795 

.  886 

4 . 9 

2 . 55 

2 . 12 

Laukkanen 

43.30 

3 . 60 

47 . 00 

4 .80 

.  646 

.  785 

6.3 

3.21 

2 . 61 

Laukkanen 

37.20 

5 . 60 

41 .  90 

6 .70 

.  834 

.  909 

8.3 

3.39 

2 . 61 

Laukkanen 

36.80 

5.20 

38 . 70 

5 .70 

.  881 

.  937 

6 . 7 

2 . 58 

1 .  91 

MacSween 

50 . 14 

8 .75 

. 

. 

.895 

.  945 

7 . 8 

3.90 

. 

Magel 

55.00 

4 .00 

55.00 

3.20 

.  830 

.  907 

4 . 1 

2 . 01 

1 . 58 

McArdle 

38 . 14 

3 .87 

38 . 70 

4 .02 

.  909 

.  952 

4.2 

1 .  64 

1 .21 

Miller 

59.70 

6 .70 

. 

. 

.  963 

.  981 

3 . 0 

1 . 81 

. 

Montgomery 

45.50 

8 .10 

. 

. 

.  907 

.  951 

7.5 

3.41 

. 

Paterson 

58 . 90 

6 . 60 

60 . 30 

4 .70 

.864 

.  927 

5 . 6 

2 . 84 

2 . 45 

Pawelcyzk 

41.20 

10.51 

44 . 80 

9.96 

.  939 

.969 

8 . 8 

3.52 

2 .56 

Pivarnik 

41.20 

5.20 

40 . 80 

4 .80 

.  870 

.  930 

6.2 

2 . 47 

1 . 82 

Shaver 

53.50 

5 . 60 

. 

. 

.  920 

.  958 

4 . 1 

2 .19 

. 

Sproule 

51 . 50 

6 .04 

. 

. 

.  900 

.  947 

5 . 1 

2 . 63 

. 
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Table  B-2 .  V02max  Statistics  (continued) 


Sr.  Author 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Thomas 

24 . 70 

5 .40 

25.90 

6 .40 

900 

.  947 

9.5 

2 . 57 

.  98 

Turley 

51 . 70 

5 . 90 

. 

. 

905 

.  950 

4 . 9 

2 . 51 

. 

Turley 

46.20 

6 .70 

. 

. 

887 

.  940 

6 . 7 

3.09 

. 

Walters 

45.15 

3 . 62 

49.34 

5.19 

781 

.  877 

5 . 0 

2 . 75 

2.31 

Ward 

53.00 

9.20 

53.40 

9 . 60 

850 

.  919 

9 . 1 

4 . 95 

3 . 65 

Weltman 

63.30 

4 .70 

65.60 

6 .70 

710 

.  830 

5.2 

4 . 01 

3.34 

Note.  Columns 

are  1 . 

Average  V02max, 

Time  1; 

2  =  SD 

VC)2maxA  T  ime  1  f 

3=Average  V02max,  Time  2;  4=SD  V02max,  Time  2;  5=Test-Retest  correlation 
(rxx)  ;  6=Intraclass  Correlation  (ICC);  7=Coef f icient  of  variation  (CV) ; 
8=Standard  Error  of  Measurement  (SEM) ;  9=Maximum  Likelihood  SEM.  See 
text  for  definitions. 
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Appendix  C 


Preliminary  Evaluation  of  V02max  as  a  Function  of  Test  Occasion 

A  set  of  preliminary  analyses  evaluated  the  equivalence  of  the 
first  and  second  V02max  tests.  The  general  hypothesis  was  that  the  tests 
were  equivalent.  This  general  hypothesis  could  be  tested  in  the  set  of 
24  studies  for  which  the  mean  and  standard  deviation  were  available 
for  both  tests. 

Average  Scores 

Average  V02max  values  were  highly  correlated  (r  =  .984)  .  The 
regression  equation  to  predict  the  second  average  based  on  the  first 
average  was  V2'  =  0.749  +  1.000*1/!,  where  ^'indicates  the  predicted 
V02max  for  the  second  measurement  based  on  the  first  measurement.  The 
standard  errors  for  the  coefficients  were  1.394  and  .031, 
respectively.  Thus,  the  95%  confidence  interval  (Cl)  for  the  slope 
included  1.00  (Cl  =  0.939,  1.061) .  The  95%  Cl  for  the  intercept 
included  0.00  (Cl  =  -1.98,  3.47)  .  The  average  V02max  was  higher  for  the 
first  test  than  for  the  second  test  in  18  of  24  samples.  However,  the 
differences  were  uniformly  small.  The  weighted  average  was  44.19  ml • kg” 
1-min"1  for  the  first  test  and  43.45  ml  •  kg-1  •  min”1  for  the  second  test. 
The  difference  was  too  small  to  be  significant  in  every  individual 
sample  (|t|  <  1.29) .  The  difference  was  not  even  significant  when 
pooled  across  samples  (Z  =  -1.05,  p  >  .293,  method  of  adding  ts;  cf. 
Rosenthal,  1978). 

Standard  Deviations 

The  sample  estimates  of  the  standard  deviation  were  stable  (r  = 
.89) .  The  regression  to  predict  the  standard  deviation  for  the  first 
test  was  S2'  =  1.420  +  0.802*Si.  The  standard  errors  were  0.498  and 
0.070,  respectively.  The  95%  Cl  for  the  slope  approached,  but  did  not 
reach  1.00  (Cl  =  .665,  .939) .  The  95%  Cl  for  the  intercept  did  not 

include  0.00  (Cl  =  .444,  2.396) .  However,  these  results  may  be  related 
to  the  fact  that  regression  models  assume  that  the  predictor  variable 
is  measured  without  error.  The  confirmatory  factor  analysis  (CFA) 
results  clearly  indicated  that  SEM  was  invariant  across  test 
occasions . 

Confirmatory  Factor  Analysis  (CFA)  Models 

LISREL  8.5  (Joreskog  &  Sorbom,  1996)  was  used  to  fit  CFA  models 
that  tested  2  important  hypotheses  regarding  the  standard  deviations. 
The  CFA  models  treated  each  V02max  test  as  an  indicator  of  a  single 
V02max  latent  trait.  The  model  constrained  the  factor  loading  to  be  the 
same  for  both  tests.  This  constraint  embodied  the  assumption  that  a 
person's  true  V02max  did  not  change  between  test  sessions.  If  so,  the 
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Table  C-l .  Comparison  of  CFA  Models 


Crit 


Model 

df 

x2 

Sig . 

RMSEA 

p (close) 

NNFI 

N 

SRMR 

Invariant 

47 

118 . 38 

.0000 

.296 

.214 

.  944 

264 

.  321 

Mode  Specific 

45 

84 . 83 

.0004 

.  226 

.315 

.  958 

311 

.  309 

Sample  Specific 

24 

23.79 

.4738 

.  000 

.532 

.997 

697 

.278 

Note,  df  =  degrees  of  freedom;  RMSEA  =  root  mean-square  error  of 
approximation  (Steiger,  1990);  NNFI  =  non-normed  fit  index  (Bentler  & 
Bonett,  1980);  Crit  N  =  critical  N  (Hoelter,  1983);  SRMR  = 
standardized  root  mean-square  residual  (Joreskog  &  Sorbom,  1996) . 


true  score  variance  for  the  sample  would  be  unchanged.  The  factor 
loadings  are  indicators  of  this  true  score  variance,  so  it  follows 
that  they  would  be  unchanged. 

Alternative  models  were  defined  by  imposing  constraints  on  the 
standard  errors .  Every  model  imposed  the  constraint  that  SEM  was  the 
same  for  both  tests  within  each  sample.  Different  models  were  obtained 
by  varying  whether  equality  constraints  were  imposed  across  samples. 
The  broadest  constraint  assumed  that  SEM  was  constant  across  all 
studies.  A  second  model  assumed  that  SEM  differed  between  exercise 
modes,  but  was  constant  within  modes.  A  third  model  assumed  that  each 
study  produced  a  unique  SEM  that  was  constant  across  test  sessions  for 
that  sample . 

The  LISREL  analysis  was  limited  to  the  24  studies  with  standard 
deviation  data  for  both  tests.  One  model  constrained  the  error  to  be 
the  same  across  all  tests  (invariant) .  One  model  constrained  the  error 
to  be  the  same  within  exercise  mode  (mode  specific) .  One  model 
permitted  a  distinct  error  for  each  sample  (sample  specific) . 

All  3  models  were  acceptable  by  several  criteria  (Table  C-l) .  The 
p  (close)  values  indicated  that  each  model  was  within  chance  of  the 
recommended  RMSEA  =  .05  value.  All  3  NNFI  values  exceeded  .900.  All  3 
critical  Ns  exceeded  200. 

Criteria  that  differed  between  models  generally  favored  the 
sample-specific  model.  First,  the  overall  x2  decreased  significantly 
moving  from  the  invariant  model  to  the  group-specific  model  (y2  = 

33.55,  2  df,  p  <  .001)  and  then  from  the  group-specific  model  to  the 
sample-specific  (y2  =  61.04,  21  df,  p  <  .001) .  Second,  the  sample- 
specific  model  was  the  only  one  for  which  the  overall  x2  was 
nonsignificant.  Third,  the  RMSEA  estimate  for  the  sample-specific 
model  was  .000.  Fourth,  the  critical  N  sample-specific  was  more  than 
twice  as  large  as  that  for  the  group-specific  model.  Finally,  the  SRMR 
was  smallest  for  the  sample-specific  model.  However,  if  parsimony 
adjustments  had  been  introduced  the  NNFI  for  the  sample-specific  model 
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would  have  been  substantially  lower  than  that  for  the  other  two 
models . 

The  CFA  led  to  2  primary  conclusions.  First,  SEM  can  be  regarded 
as  invariant  across  tests.  This  result  means  that  analysis  of  the  SEM 
for  the  first  test  is  a  reasonable  basis  for  inferences  about  test 
errors.  Second,  SEM  varies  from  sample  to  sample.  This  inference  is 
supported  by  the  general  improvement  in  model  fit  when  the  sample- 
specific  model  is  compared  with  the  alternatives.  The  other  models 
would  be  adequate  by  accepted  modeling  standards  and  could  be 
preferred  for  their  parsimony  (Mulaik  et  al . ,  1989)  .  However,  an 
erroneous  inference  about  the  existence  of  sample-specific  values  for 
SEM  should  not  cause  problems.  This  review  attempts  to  identify 
factors  that  affect  SEM.  If  the  sample-to-sample  variation  is  truly 
the  result  of  chance,  there  should  be  only  a  few  chance  associations 
between  SEM  and  potential  predictors.  The  modeling  attempts  should 
reinforce  the  inference  that  SEM  differences  are  chance. 
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Appendix  D 


Using  SEM  to  Correct  Effect  Size  Estimates 

V02max  SEM  estimates  can  be  used  to  assess  the  effects  of 
measurement  error  on  estimates  of  associations  between  V02max  and  other 
variables.  For  example,  the  correlation  between  two  variables,  x  and 
y,  is 


rxy  =  Cxy/  (Sx*Sy)  1/2  (Equation  D-l) 

Given  the  usual  assumption  that  errors  are  uncorrelated,  Cxy,  the 
covariance  of  x  and  y,  is  determined  entirely  by  the  correspondence 
between  true  scores.  Sxr  the  standard  deviation  of  x,  and  Sy,  the 
standard  deviation  of  y,  are  composites  of  true  score  and  error 
variance . 

SEM  can  be  used  to  correct  effect  size  estimates  because 
eliminating  measurement  error  does  not  affect  Cxy.  This  parameter  is 
not  affected  because  SEM,  by  definition,  does  not  contribute  to  Cxy. 
However,  eliminating  measurement  error  does  reduce  Sx  and/or  Sx.  The 
elimination  of  measurement  error  increases  rxy  because  the  denominator 
of  Equation  3  is  reduced  while  the  numerator  remains  constant. 

SEM  estimates  can  be  used  to  correct  for  attenuation  due  to 
measurement  error.  The  first  step  in  using  SEM  estimates  true  score 
variance,  SDt,  by  computing  S'  =  V  (S2  -  SEM2)  .  An  SEM  estimate  based  on 
Equation  2  can  be  used  for  this  computation.  Substituting  S'  into 
Equation  3  then  provides  an  adjusted  estimate  of  rxy. 

To  illustrate  the  correction  process,  consider  the  association  of 
running  performance  with  V02max.  The  expected  association  would  be  r  = 
.82  with  an  associated  SD  =  6.2  for  V02max  (Vickers,  2001a,  2001b)  . 
Adopting  Equation  2b  as  a  robust  estimate  of  the  relationship  between 
SEM  and  average  VO  2max,  SEM  =  2.7  at  50  ml  •  kg-1  •  min-1 .  The  estimated  true 
score  variance  is  S'  =  5.6  ml  •  kg-1  •  min-1 .  If  V02max  is  the  y  variable, 
the  denominator  for  Equation  3,  Sx*Syr  ,  is  10%  smaller  than  the 
original  denominator,  Sx*Sy.  Cxy  remains  constant,  so  the  smaller 
denominator  increases  the  estimated  run  tirne-V02max  correlation 
increases  to  r  =  .91.  Although  this  illustration  involves  a 
correlation  coefficient,  the  general  approach  extends  to  most  common 
measures  of  effect  size  because  effect  size  indicators  generally  can 
be  converted  into  correlations  (Hedges  &  Olkin,  1985) . 

The  correction  procedure  illustrated  above  is  equivalent  to 
applying  a  well-known  equation  to  correct  for  attenuation  due  to 
measurement  error.  The  equation  is  based  on  rxx,  but  the  equivalence 
follows  from  the  relationships  between  rxx,  SDt,  and  SEM  (cf.,  Nunnally 
&  Bernstein,  1994,  pp .  260-262)  . 
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